Bioconductor AnnotationData Packages: http://www.bioconductor.org/packages/release/data/annotation/

AnnotationHub:: https://bioconductor.org/packages/AnnotationHub/

License: GPL-3.0

Introduction

There are many organism-level (org) packages readily available on Bioconductor. They provide mappings between a central identifier (e.g. Entrez Gene identifiers) and other identifiers (e.g. ensembl ID, Refseq Identifiers, GO Identifiers, etc).

The name of an org package is always of the form org.<Sp>.<id>.db (e.g. org.Hs.eg.db) where <Sp> is a 2-letter abbreviation of the organism (e.g. Hs for Homo sapiens) and <id> is an abbreviation (in lower-case) describing the type of central identifier (e.g. eg for gene identifiers assigned by the Entrez Gene, or sgd for Saccharomyces Genome Database). Most of the Bioconductor annotation packages are updated every 6 months.

Start R

cd /ngs/GO-Enrichment-Analysis-Demo

R

Using BiocManager

List available organism-level packages for installation in BiocManager.

BiocManager::available("^org\\.")
##  [1] "org.Ag.eg.db"      "org.At.tair.db"    "org.Bt.eg.db"      "org.Ce.eg.db"     
##  [5] "org.Cf.eg.db"      "org.Dm.eg.db"      "org.Dr.eg.db"      "org.EcK12.eg.db"  
##  [9] "org.EcSakai.eg.db" "org.Gg.eg.db"      "org.Hs.eg.db"      "org.Mm.eg.db"     
## [13] "org.Mmu.eg.db"     "org.Mxanthus.db"   "org.Pf.plasmo.db"  "org.Pt.eg.db"     
## [17] "org.Rn.eg.db"      "org.Sc.sgd.db"     "org.Ss.eg.db"      "org.Xl.eg.db"

Install Arabidopsis org package

As an example, let’s download and install the Arabidopsis thaliana (thale cress) package.

BiocManager::install("org.At.tair.db")
## Bioconductor version 3.11 (BiocManager 1.30.10), R 4.0.2 (2020-06-22)
## Installing package(s) 'org.At.tair.db'
## Updating HTML index of packages in '.Library'
## Making 'packages.html' ... done
## Old packages: 'cpp11', 'ps'
library(org.At.tair.db)
## Loading required package: AnnotationDbi
## Loading required package: stats4
## Loading required package: BiocGenerics
## Loading required package: parallel
## 
## Attaching package: 'BiocGenerics'
## 
## The following objects are masked from 'package:parallel':
## 
##     clusterApply, clusterApplyLB, clusterCall, clusterEvalQ, clusterExport,
##     clusterMap, parApply, parCapply, parLapply, parLapplyLB, parRapply,
##     parSapply, parSapplyLB
## 
## The following objects are masked from 'package:stats':
## 
##     IQR, mad, sd, var, xtabs
## 
## The following objects are masked from 'package:base':
## 
##     anyDuplicated, append, as.data.frame, basename, cbind, colnames, dirname,
##     do.call, duplicated, eval, evalq, Filter, Find, get, grep, grepl, intersect,
##     is.unsorted, lapply, Map, mapply, match, mget, order, paste, pmax, pmax.int,
##     pmin, pmin.int, Position, rank, rbind, Reduce, rownames, sapply, setdiff,
##     sort, table, tapply, union, unique, unsplit, which, which.max, which.min
## 
## Loading required package: Biobase
## Welcome to Bioconductor
## 
##     Vignettes contain introductory material; view with 'browseVignettes()'. To
##     cite Bioconductor, see 'citation("Biobase")', and for packages
##     'citation("pkgname")'.
## 
## Loading required package: IRanges
## Loading required package: S4Vectors
## 
## Attaching package: 'S4Vectors'
## 
## The following object is masked from 'package:base':
## 
##     expand.grid
org.At.tair.db
## OrgDb object:
## | DBSCHEMAVERSION: 2.1
## | Db type: OrgDb
## | Supporting package: AnnotationDbi
## | DBSCHEMA: ARABIDOPSIS_DB
## | ORGANISM: Arabidopsis thaliana
## | SPECIES: Arabidopsis
## | TAIRSOURCENAME: Tair
## | TAIRSOURCEDATE: 2020-Apr01
## | TAIRSOURCEURL: https://www.arabidopsis.org/
## | TAIRGOURL: https://www.arabidopsis.org/download_files/GO_and_PO_Annotations/Gene_Ontology_Annotations/ATH_GO_GOSLIM.txt
## | TAIRGENEURL: https://www.arabidopsis.org/download_files/Genes/TAIR10_genome_release/TAIR10_functional_descriptions
## | TAIRSYMBOLURL: https://www.arabidopsis.org/download_files/Public_Data_Releases/TAIR_Data_20190331/gene_aliases_20190402.txt.gz
## | TAIRPATHURL: ftp://ftp.plantcyc.org/Pathways/Data_dumps/PMN14_January2020/pathways/Ara_pathways.20200125
## | TAIRPMIDURL: https://www.arabidopsis.org/download_files/Public_Data_Releases/TAIR_Data_20190331/Locus_Published_20190402.txt.gz
## | TAIRCHRURL: https://www.arabidopsis.org/download_files/Maps/seqviewer_data/sv_gene.data
## | TAIRATHURL: https://www.arabidopsis.org/download_files/Microarrays/Affymetrix/affy_ATH1_array_elements-2010-12-20.txt
## | TAIRAGURL: https://www.arabidopsis.org/download_files/Microarrays/Affymetrix/affy_AG_array_elements-2010-12-20.txt
## | CENTRALID: TAIR
## | TAXID: 3702
## | KEGGSOURCENAME: KEGG GENOME
## | KEGGSOURCEURL: ftp://ftp.genome.jp/pub/kegg/genomes
## | KEGGSOURCEDATE: 2011-Mar15
## | GOSOURCENAME: Gene Ontology
## | GOSOURCEURL: http://current.geneontology.org/ontology/go-basic.obo
## | GOSOURCEDATE: 2020-05-02
## | GOEGSOURCEDATE: 2019-Jul10
## | GOEGSOURCENAME: Entrez Gene
## | GOEGSOURCEURL: ftp://ftp.ncbi.nlm.nih.gov/gene/DATA
## | EGSOURCEDATE: 2019-Jul10
## | EGSOURCENAME: Entrez Gene
## | EGSOURCEURL: ftp://ftp.ncbi.nlm.nih.gov/gene/DATA
## 
## Please see: help('select') for usage information

Using AnnotationHub

Above method returns a limited number of organism-level annotation packages. There are a lot more packages available from the Bioconductor’s AnnotationHub service.

To search, download and install packages from the AnnotationHub service, install AnnotationHub if it is not yet installed in your machine.

if (!requireNamespace("BiocManager", quietly = TRUE))
install.packages("BiocManager")
BiocManager::install("AnnotationHub")
## Bioconductor version 3.11 (BiocManager 1.30.10), R 4.0.2 (2020-06-22)
## Installing package(s) 'AnnotationHub'
## Updating HTML index of packages in '.Library'
## Making 'packages.html' ... done
## Old packages: 'cpp11', 'ps'

Create an AnnotationHub object

library(AnnotationHub)
## Loading required package: BiocFileCache
## Loading required package: dbplyr
## 
## Attaching package: 'AnnotationHub'
## The following object is masked from 'package:Biobase':
## 
##     cache
ah <- AnnotationHub()
## using temporary cache /tmp/RtmpIaAhO4/BiocFileCache
## snapshotDate(): 2020-04-27
# URL for the online AnnotationHub
hubUrl(ah)
## [1] "https://annotationhub.bioconductor.org"

Summary of available records

ah
## AnnotationHub with 50277 records
## # snapshotDate(): 2020-04-27
## # $dataprovider: BroadInstitute, Ensembl, UCSC, ftp://ftp.ncbi.nlm.nih.gov/gene/DATA/,...
## # $species: Homo sapiens, Mus musculus, Drosophila melanogaster, Bos taurus, Pan trogl...
## # $rdataclass: GRanges, BigWigFile, TwoBitFile, Rle, EnsDb, OrgDb, ChainFile, TxDb, In...
## # additional mcols(): taxonomyid, genome, description, coordinate_1_based,
## #   maintainer, rdatadateadded, preparerclass, tags, rdatapath, sourceurl,
## #   sourcetype 
## # retrieve records with, e.g., 'object[["AH5012"]]' 
## 
##             title                                                         
##   AH5012  | Chromosome Band                                               
##   AH5013  | STS Markers                                                   
##   AH5014  | FISH Clones                                                   
##   AH5015  | Recomb Rate                                                   
##   AH5016  | ENCODE Pilot                                                  
##   ...       ...                                                           
##   AH83110 | Zonotrichia_albicollis.Zonotrichia_albicollis-1.0.1.ncrna.2bit
##   AH83111 | Zosterops_lateralis_melanops.ASM128173v1.cdna.all.2bit        
##   AH83112 | Zosterops_lateralis_melanops.ASM128173v1.dna_rm.toplevel.2bit 
##   AH83113 | Zosterops_lateralis_melanops.ASM128173v1.dna_sm.toplevel.2bit 
##   AH83114 | Zosterops_lateralis_melanops.ASM128173v1.ncrna.2bit
# Number of resources
length(ah)
## [1] 50277

Query the hub for org records

Search for organism-level packages with a pattern-matching string “^org\\.”.

db <- query(ah, "^org\\.")
df <- mcols(db)
class(df)
## [1] "DFrame"
## attr(,"package")
## [1] "S4Vectors"

Show query results

Show query results stored in DFrame.

# Column names
cbind(colnames(df))
##       [,1]                
##  [1,] "title"             
##  [2,] "dataprovider"      
##  [3,] "species"           
##  [4,] "taxonomyid"        
##  [5,] "genome"            
##  [6,] "description"       
##  [7,] "coordinate_1_based"
##  [8,] "maintainer"        
##  [9,] "rdatadateadded"    
## [10,] "preparerclass"     
## [11,] "tags"              
## [12,] "rdataclass"        
## [13,] "rdatapath"         
## [14,] "sourceurl"         
## [15,] "sourcetype"
# Number of org records
nrow(df)
## [1] 1480
# Show df
df[,c("title", "species")]
## DataFrame with 1480 rows and 2 columns
##                                                 title                         species
##                                           <character>                     <character>
## AH79568                           org.Ag.eg.db.sqlite               Anopheles gambiae
## AH79569                         org.At.tair.db.sqlite            Arabidopsis thaliana
## AH79570                           org.Bt.eg.db.sqlite                      Bos taurus
## AH79571                           org.Cf.eg.db.sqlite                Canis familiaris
## AH79572                           org.Gg.eg.db.sqlite                   Gallus gallus
## ...                                               ...                             ...
## AH81959            org.Bathycoccus_prasinos.eg.sqlite            Bathycoccus prasinos
## AH81960        org.Kwoniella_pini_CBS_10737.eg.sqlite        Kwoniella pini_CBS_10737
## AH81961 org.Burkholderia_cepacia_ATCC_25416.eg.sqlite Burkholderia cepacia_ATCC_25416
## AH81962   org.Burkholderia_cepacia_DSM_7288.eg.sqlite   Burkholderia cepacia_DSM_7288
## AH81963   org.Burkholderia_cepacia_LMG_1222.eg.sqlite   Burkholderia cepacia_LMG_1222

Download Felis org package

Let’s search and install the Felis catus (cat) package.

# Search df with keyword
data.table::as.data.table(df[,c("title", "species")], keep.rownames = TRUE)[grep("Felis", species)]
##         rn                                title                species
## 1: AH80647            org.Felis_catus.eg.sqlite            Felis catus
## 2: AH80648       org.Felis_domesticus.eg.sqlite       Felis domesticus
## 3: AH80649 org.Felis_silvestris_catus.eg.sqlite Felis silvestris_catus
## 4: AH80906       org.Felis_canadensis.eg.sqlite       Felis canadensis
## 5: AH81162         org.Felis_concolor.eg.sqlite         Felis concolor
# Retrieve package with ID "AH80647"
org.Fc.eg.db <- ah[["AH80647"]]
## downloading 1 resources
## retrieving 1 resource
## loading from cache
org.Fc.eg.db
## OrgDb object:
## | DBSCHEMAVERSION: 2.1
## | DBSCHEMA: NOSCHEMA_DB
## | ORGANISM: Felis catus
## | SPECIES: Felis catus
## | CENTRALID: GID
## | Taxonomy ID: 9685
## | Db type: OrgDb
## | Supporting package: AnnotationDbi
## 
## Please see: help('select') for usage information

Show record status

recordStatus(ah, "AH80647")
##    record status  dateadded
## 1 AH80647 Public 2020-04-27

Load from local cache

After retrieving an annotation package, it will be placed in the local AnnotationHub cache. You can used it again without having to download the package.

# Location of the local AnnotationHub cache
hubCache(ah)
## [1] "/home/ihsuan/.cache/AnnotationHub"
# Load from cache
org.Fc.eg.db <- ah[["AH80647"]]
## loading from cache

Clear local cache

You can use the removeCache function to removes all local AnnotationHub database and all related resources.

removeCache(ah, ask = TRUE)

Discover org db objects

columns

Shows which kinds of data can be returned for the AnnotationDb object.

Both objects contain Gene Ontology mapping information.

columns(org.At.tair.db)
##  [1] "ARACYC"       "ARACYCENZYME" "ENTREZID"     "ENZYME"       "EVIDENCE"    
##  [6] "EVIDENCEALL"  "GENENAME"     "GO"           "GOALL"        "ONTOLOGY"    
## [11] "ONTOLOGYALL"  "PATH"         "PMID"         "REFSEQ"       "SYMBOL"      
## [16] "TAIR"
columns(org.Fc.eg.db)
##  [1] "ACCNUM"      "ALIAS"       "CHR"         "ENSEMBL"     "ENTREZID"    "EVIDENCE"   
##  [7] "EVIDENCEALL" "GENENAME"    "GID"         "GO"          "GOALL"       "ONTOLOGY"   
## [13] "ONTOLOGYALL" "PMID"        "REFSEQ"      "SYMBOL"

keytypes

Shows which columns can be used as keys.

keytypes(org.At.tair.db)
##  [1] "ARACYC"       "ARACYCENZYME" "ENTREZID"     "ENZYME"       "EVIDENCE"    
##  [6] "EVIDENCEALL"  "GENENAME"     "GO"           "GOALL"        "ONTOLOGY"    
## [11] "ONTOLOGYALL"  "PATH"         "PMID"         "REFSEQ"       "SYMBOL"      
## [16] "TAIR"
keytypes(org.Fc.eg.db)
##  [1] "ACCNUM"      "ALIAS"       "ENSEMBL"     "ENTREZID"    "EVIDENCE"    "EVIDENCEALL"
##  [7] "GENENAME"    "GID"         "GO"          "GOALL"       "ONTOLOGY"    "ONTOLOGYALL"
## [13] "PMID"        "REFSEQ"      "SYMBOL"

keys

Returns values (or keys) that can be expected for a given keytype. By default it will return the primary keys for the database.

head(keys(org.At.tair.db), 10)  # Primary keys
##  [1] "AT1G01010" "AT1G01020" "AT1G01030" "AT1G01040" "AT1G01050" "AT1G01060" "AT1G01070"
##  [8] "AT1G01073" "AT1G01080" "AT1G01090"
head(keys(org.At.tair.db, keytype = "SYMBOL"), 10)
##  [1] "ANAC001" "NAC001"  "NTL10"   "ARV1"    "NGA3"    "ASU1"    "ATDCL1"  "CAF"    
##  [9] "DCL1"    "EMB60"
head(keys(org.At.tair.db, keytype = "GO"), 10)
##  [1] "GO:0003700" "GO:0005634" "GO:0006355" "GO:0003674" "GO:0005739" "GO:0005783"
##  [7] "GO:0005794" "GO:0006665" "GO:0009507" "GO:0016125"
head(keys(org.Fc.eg.db), 10)    # Primary keys
##  [1] "414734" "445455" "448843" "492297" "492308" "493648" "493649" "493650" "493651"
## [10] "493652"
head(keys(org.Fc.eg.db, keytype = "SYMBOL"), 10)
##  [1] "A1BG"    "A1CF"    "A2M"     "A2ML1"   "A3GALT2" "A4GALT"  "A4GNT"   "AAAS"   
##  [9] "AACS"    "AADAC"
head(keys(org.Fc.eg.db, keytype = "GO"), 10)
##  [1] "GO:0000002" "GO:0000003" "GO:0000012" "GO:0000014" "GO:0000015" "GO:0000027"
##  [7] "GO:0000028" "GO:0000030" "GO:0000033" "GO:0000035"

select

Retrieve the data as a data.frame based on parameters for selected keys, columns and keytype arguments.

Ex1: Given TAIR ID, retrieves SYMBOL

myKeys <- head(keys(org.At.tair.db, keytype = "TAIR"), 10)
myKeys
##  [1] "AT1G01010" "AT1G01020" "AT1G01030" "AT1G01040" "AT1G01050" "AT1G01060" "AT1G01070"
##  [8] "AT1G01073" "AT1G01080" "AT1G01090"
select(org.At.tair.db, keys = myKeys, columns = "SYMBOL", keytype = "TAIR")
## 'select()' returned 1:many mapping between keys and columns
##         TAIR   SYMBOL
## 1  AT1G01010  ANAC001
## 2  AT1G01010   NAC001
## 3  AT1G01010    NTL10
## 4  AT1G01020     ARV1
## 5  AT1G01030     NGA3
## 6  AT1G01040     ASU1
## 7  AT1G01040   ATDCL1
## 8  AT1G01040      CAF
## 9  AT1G01040     DCL1
## 10 AT1G01040    EMB60
## 11 AT1G01040    EMB76
## 12 AT1G01040     SIN1
## 13 AT1G01040     SUS1
## 14 AT1G01050   AtPPa1
## 15 AT1G01050     PPa1
## 16 AT1G01060      LHY
## 17 AT1G01060     LHY1
## 18 AT1G01070 UMAMIT28
## 19 AT1G01073     <NA>
## 20 AT1G01080     <NA>
## 21 AT1G01090   PDH-E1

Ex2: Given SYMBOL, retrieves ENTREZID ID

myKeys <- c("CCA1", "LHY", "PRR7", "PRR9") # morning loop components
select(org.At.tair.db, keys = myKeys, columns = "ENTREZID", keytype = "SYMBOL")
## 'select()' returned 1:1 mapping between keys and columns
##   SYMBOL ENTREZID
## 1   CCA1   819296
## 2    LHY   839341
## 3   PRR7   831793
## 4   PRR9   819292

Ex3: Given ENSEMBL ID, retrieves SYMBOL

myKeys <- head(keys(org.Fc.eg.db, keytype = "ENSEMBL"), 10)
myKeys
##  [1] "ENSFCAG00000000001" "ENSFCAG00000000007" "ENSFCAG00000000015" "ENSFCAG00000000022"
##  [5] "ENSFCAG00000000023" "ENSFCAG00000000024" "ENSFCAG00000000028" "ENSFCAG00000000029"
##  [9] "ENSFCAG00000000030" "ENSFCAG00000000031"
select(org.Fc.eg.db, keys = myKeys, columns = "SYMBOL", keytype = "ENSEMBL")
## 'select()' returned 1:1 mapping between keys and columns
##               ENSEMBL     SYMBOL
## 1  ENSFCAG00000000001     INTS6L
## 2  ENSFCAG00000000007      HMGCR
## 3  ENSFCAG00000000015     CEP192
## 4  ENSFCAG00000000022    RASGRP1
## 5  ENSFCAG00000000023      GPR39
## 6  ENSFCAG00000000024      LYPD1
## 7  ENSFCAG00000000028       RCN3
## 8  ENSFCAG00000000029       APOO
## 9  ENSFCAG00000000030  CXHXorf58
## 10 ENSFCAG00000000031 CB1H4orf19

Ex4: Given SYMBOL, retrieves ENSEMBL ID and ENTREZID ID

myKeys <- c("ASIP", "MC1R") # coat color patterns
select(org.Fc.eg.db, keys = myKeys, columns = c("ENSEMBL", "ENTREZID"), keytype = "SYMBOL")
## 'select()' returned 1:1 mapping between keys and columns
##   SYMBOL            ENSEMBL ENTREZID
## 1   ASIP ENSFCAG00000011037   492297
## 2   MC1R ENSFCAG00000003798   493917

Session information

sessionInfo()
## R version 4.0.2 (2020-06-22)
## Platform: x86_64-conda_cos6-linux-gnu (64-bit)
## Running under: Ubuntu 18.04.4 LTS
## 
## Matrix products: default
## BLAS/LAPACK: /home/ihsuan/miniconda3/envs/r4/lib/libopenblasp-r0.3.10.so
## 
## locale:
##  [1] LC_CTYPE=en_GB.UTF-8       LC_NUMERIC=C               LC_TIME=en_GB.UTF-8       
##  [4] LC_COLLATE=en_GB.UTF-8     LC_MONETARY=en_GB.UTF-8    LC_MESSAGES=en_GB.UTF-8   
##  [7] LC_PAPER=en_GB.UTF-8       LC_NAME=C                  LC_ADDRESS=C              
## [10] LC_TELEPHONE=C             LC_MEASUREMENT=en_GB.UTF-8 LC_IDENTIFICATION=C       
## 
## attached base packages:
## [1] parallel  stats4    stats     graphics  grDevices utils     datasets  methods  
## [9] base     
## 
## other attached packages:
##  [1] AnnotationHub_2.20.1  BiocFileCache_1.12.1  dbplyr_1.4.4         
##  [4] org.At.tair.db_3.11.4 AnnotationDbi_1.50.3  IRanges_2.22.2       
##  [7] S4Vectors_0.26.1      Biobase_2.48.0        BiocGenerics_0.34.0  
## [10] knitr_1.29           
## 
## loaded via a namespace (and not attached):
##  [1] Rcpp_1.0.5                    later_1.1.0.1                
##  [3] compiler_4.0.2                pillar_1.4.6                 
##  [5] BiocManager_1.30.10           tools_4.0.2                  
##  [7] digest_0.6.25                 bit_4.0.4                    
##  [9] RSQLite_2.2.0                 evaluate_0.14                
## [11] memoise_1.1.0                 tibble_3.0.3                 
## [13] lifecycle_0.2.0               pkgconfig_2.0.3              
## [15] rlang_0.4.7                   shiny_1.5.0                  
## [17] DBI_1.1.0                     curl_4.3                     
## [19] yaml_2.2.1                    xfun_0.16                    
## [21] fastmap_1.0.1                 httr_1.4.2                   
## [23] stringr_1.4.0                 dplyr_1.0.1                  
## [25] rappdirs_0.3.1                generics_0.0.2               
## [27] vctrs_0.3.2                   tidyselect_1.1.0             
## [29] bit64_4.0.2                   data.table_1.13.0            
## [31] glue_1.4.1                    R6_2.4.1                     
## [33] rmarkdown_2.3                 purrr_0.3.4                  
## [35] blob_1.2.1                    magrittr_1.5                 
## [37] promises_1.1.1                htmltools_0.5.0              
## [39] ellipsis_0.3.1                assertthat_0.2.1             
## [41] xtable_1.8-4                  mime_0.9                     
## [43] interactiveDisplayBase_1.26.3 httpuv_1.5.4                 
## [45] stringi_1.4.6                 BiocVersion_3.11.1           
## [47] crayon_1.3.4